Building a WordNet for Sinhala

نویسندگان

  • Indeewari Wijesiri
  • Malaka Gallage
  • Buddhika Gunathilaka
  • Madhuranga Lakjeewa
  • Daya C. Wimalasuriya
  • Gihan Dias
  • Rohini Paranavithana
  • Nisansa de Silva
چکیده

Sinhala is one of the official languages of Sri Lanka and is used by over 19 million people. It belongs to the Indo-Aryan branch of the Indo-European languages and its origins date back to at least 2000 years. It has developed into its current form over a long period of time with influences from a wide variety of languages including Tamil, Portuguese and English. As for any other language, a WordNet is extremely important for Sinhala to take it into the digital era. This paper is based on the project to develop a WordNet for Sinhala based on the English (Princeton) WordNet. It describes how we overcame the challenges in adding Sinhala specific characteristics which were deemed important by Sinhala language experts to the WordNet while keeping the structure of the original English WordNet. It also presents the details of the crowdsourcing system we developed as a part of the project consisting of a NoSQL database in the backend and a web-based frontend. We conclude by discussing the possibility of adapting this architecture for other languages and the road ahead for the Sinhala WordNet and Sinhala NLP.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speaker Adaptation Applied to Sinhala Speech Recognition

Sinhala, which the main spoken language of the majority of Sri Lanka, is an under-resourced language. Sinhala language is new to the speech recognition research field and faces the problem of not having suitable speech corpora available. For a language like Sinhala, it is essential to find out ways of developing good recognition models using a fewer sample of data. Speaker Adaptive methods prov...

متن کامل

The Interaction of Transitivity Features in the Sinhala Involitive1

The Sinhala volitive/involitive contrast is characterized by verb stem and subject case marking alternations, and broadly indicates the volitionality/non-volitionality of the subject, plus other co-varying features. While superficially a high/low transitivity split à la Hopper and Thompson (1980), we argue that the distinction actually emerges from the interaction of just two factors: a realis/...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Building Czech Wordnet

This paper describes the process of building Czech wordnet. We give the enumeration of the resources and tools used for this purpose and characterize so far obtained results. There are some problems with Czech as a synthetic language, with its rich inflectional morphology and word derivation. They are mentioned below and some solutions are suggested. The necessary resources for building Czech w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014